为了提高具有低硬件要求的中文文本分类模型的准确性性能,本文设计了一种改进的基于替代的模型,这是5种不同的子模型,包括Textcnn,LSTM和Bi-LSTM的替代。与现有的集合学习方法相比,对于文本分类任务,这种型号的准确性更高。同时,该模型的硬件要求远低于基于BERT的模型。
translated by 谷歌翻译
We propose a novel teacher-student model for semi-supervised multi-organ segmentation. In teacher-student model, data augmentation is usually adopted on unlabeled data to regularize the consistent training between teacher and student. We start from a key perspective that fixed relative locations and variable sizes of different organs can provide distribution information where a multi-organ CT scan is drawn. Thus, we treat the prior anatomy as a strong tool to guide the data augmentation and reduce the mismatch between labeled and unlabeled images for semi-supervised learning. More specifically, we propose a data augmentation strategy based on partition-and-recovery N$^3$ cubes cross- and within- labeled and unlabeled images. Our strategy encourages unlabeled images to learn organ semantics in relative locations from the labeled images (cross-branch) and enhances the learning ability for small organs (within-branch). For within-branch, we further propose to refine the quality of pseudo labels by blending the learned representations from small cubes to incorporate local attributes. Our method is termed as MagicNet, since it treats the CT volume as a magic-cube and $N^3$-cube partition-and-recovery process matches with the rule of playing a magic-cube. Extensive experiments on two public CT multi-organ datasets demonstrate the effectiveness of MagicNet, and noticeably outperforms state-of-the-art semi-supervised medical image segmentation approaches, with +7% DSC improvement on MACT dataset with 10% labeled images.
translated by 谷歌翻译
Pre-trained vision-language models like CLIP have recently shown superior performances on various downstream tasks, including image classification and segmentation. However, in fine-grained image re-identification (ReID), the labels are indexes, lacking concrete text descriptions. Therefore, it remains to be determined how such models could be applied to these tasks. This paper first finds out that simply fine-tuning the visual model initialized by the image encoder in CLIP, has already obtained competitive performances in various ReID tasks. Then we propose a two-stage strategy to facilitate a better visual representation. The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID and give them to the text encoder to form ambiguous descriptions. In the first training stage, image and text encoders from CLIP keep fixed, and only the text tokens are optimized from scratch by the contrastive loss computed within a batch. In the second stage, the ID-specific text tokens and their encoder become static, providing constraints for fine-tuning the image encoder. With the help of the designed loss in the downstream task, the image encoder is able to represent data as vectors in the feature embedding accurately. The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks. Code is available at https://github.com/Syliz517/CLIP-ReID.
translated by 谷歌翻译
如果没有图像中的密集瓷砖锚点或网格点,稀疏的R-CNN可以通过以级联的训练方式更新的一组对象查询和建议框来实现有希望的结果。但是,由于性质稀疏以及查询与其参加地区之间的一对一关系,它在很大程度上取决于自我注意力,这通常在早期训练阶段不准确。此外,在密集对象的场景中,对象查询与许多无关的物体相互作用,从而降低了其独特性并损害了性能。本文提议在不同的框之间使用iOU作为自我注意力的价值路由的先验。原始注意力矩阵乘以从提案盒中计算出的相同大小的矩阵,并确定路由方案,以便可以抑制无关的功能。此外,为了准确提取分类和回归的功能,我们添加了两个轻巧投影头,以根据对象查询提供动态通道掩码,并且它们随动态convs的输出而繁殖,从而使结果适合两个不同的任务。我们在包括MS-Coco和CrowdHuman在内的不同数据集上验证了所提出的方案,这表明它可显着提高性能并提高模型收敛速度。
translated by 谷歌翻译
HSI受益于高光谱图像(HSI)中丰富而详细的光谱信息,为各种医学应用(例如计算病理学)提供了巨大的潜力。但是,缺乏足够的注释数据和HSIS的高时光尺寸通常会使分类网络容易过度合适。因此,必须学习可以转移到下游任务的一般表示。据我们所知,没有针对HSIS的组织病理学设计适当的自我监督预训练方法。在本文中,我们引入了一种有效,有效的自我监督光谱回归(S $^3 $ r)方法,该方法利用了HSI光谱域中的低等级特征。更具体地说,我们建议学习一组线性系数,这些系数可通过掩盖这些频段来通过其余的频段来代表一个频段。然后,通过使用学习的系数重新恢复其余频段来恢复频段。设计了两个前文本任务:(1)S $^3 $ R-CR,它回归线性系数,以便预先训练的模型了解HSIS的固有结构以及不同形态的病理特征; (2)S $^3 $ r-BR,它回归缺失的频段,使模型学习了HSIS的整体语义。与先前的艺术相比,重点是自然图像的对比度学习方法,S $^3 $ r收敛至少3倍,并且在转移到HSI分类任务时,准确性高达14%。
translated by 谷歌翻译
可控的人图像合成任务可以通过对身体姿势和外观的明确控制来实现广泛的应用。在本文中,我们提出了一个基于跨注意的样式分布模块,该模块在源语义样式和目标姿势转移的目标姿势之间计算。该模块故意选择每个语义表示的样式,并根据目标姿势分配它们。交叉注意的注意力矩阵表达了目标姿势与所有语义的源样式之间的动态相似性。因此,可以利用它来从源图像路由颜色和纹理,并受到目标解析图的进一步限制,以实现更清晰的目标。同时,为了准确编码源外观,还添加了不同语义样式之间的自我注意力。我们的模型的有效性在姿势转移和虚拟的尝试任务上进行了定量和质量验证。
translated by 谷歌翻译
采用重塑的常见做法是学习模糊和锐利图像对之间的差异,在端到端图像去孔架构之间的差异。从其模糊的对应物重建锐利图像需要有关低频和高频信息的变化。尽管传统的RESBLOCK可以具有良好的能力在捕获图像的高频分量时,但它倾向于俯视低频信息。此外,RESBLOCK通常无法富集地模拟在从其模糊的对应物中重建尖锐图像的不普通的远程信息。在本文中,我们介绍了一种剩余的快速傅里叶变换与卷积块(RES FFT-CONV块),能够捕获长期和短期交互,同时集成低频和高频残差。 RES FFT-CONC模块是一个概念简单但可计算的高效,即插即用块,导致不同架构中的表现增长显着。利用RES FFT-CONV块,我们进一步提出了一种基于MIMO-UNET的深度残留的傅里叶变换(DEEPRFT)框架,在GoPro,隐藏,Realblur和DPDD数据集上实现最先进的图像去孔性能。实验表明我们的DEEPRFT可以显着提高图像去掩饰性能(例如,与MIMO-UNET相比,Gopro Dataset上的PSNR上的1.09 dB改善),DEEPRFT +在GoPro数据集上达到PSNR中的33.23 dB。
translated by 谷歌翻译
Machine learning (ML) on graph-structured data has recently received deepened interest in the context of intrusion detection in the cybersecurity domain. Due to the increasing amounts of data generated by monitoring tools as well as more and more sophisticated attacks, these ML methods are gaining traction. Knowledge graphs and their corresponding learning techniques such as Graph Neural Networks (GNNs) with their ability to seamlessly integrate data from multiple domains using human-understandable vocabularies, are finding application in the cybersecurity domain. However, similar to other connectionist models, GNNs are lacking transparency in their decision making. This is especially important as there tend to be a high number of false positive alerts in the cybersecurity domain, such that triage needs to be done by domain experts, requiring a lot of man power. Therefore, we are addressing Explainable AI (XAI) for GNNs to enhance trust management by exploring combining symbolic and sub-symbolic methods in the area of cybersecurity that incorporate domain knowledge. We experimented with this approach by generating explanations in an industrial demonstrator system. The proposed method is shown to produce intuitive explanations for alerts for a diverse range of scenarios. Not only do the explanations provide deeper insights into the alerts, but they also lead to a reduction of false positive alerts by 66% and by 93% when including the fidelity metric.
translated by 谷歌翻译
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problem$\unicode{x2014}$One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.
translated by 谷歌翻译
Generated texts from large pretrained language models have been shown to exhibit a variety of harmful, human-like biases about various demographics. These findings prompted large efforts aiming to understand and measure such effects, with the goal of providing benchmarks that can guide the development of techniques mitigating these stereotypical associations. However, as recent research has pointed out, the current benchmarks lack a robust experimental setup, consequently hindering the inference of meaningful conclusions from their evaluation metrics. In this paper, we extend these arguments and demonstrate that existing techniques and benchmarks aiming to measure stereotypes tend to be inaccurate and consist of a high degree of experimental noise that severely limits the knowledge we can gain from benchmarking language models based on them. Accordingly, we propose a new framework for robustly measuring and quantifying biases exhibited by generative language models. Finally, we use this framework to investigate GPT-3's occupational gender bias and propose prompting techniques for mitigating these biases without the need for fine-tuning.
translated by 谷歌翻译